Assessment of a Significant Arabic Corpus
نویسندگان
چکیده
The development of Language Engineering and Information Retrieval applications for Arabic require availability of sizeable, reliable corpora of modern Arabic text. These are not routinely available. This paper describes how we constructed an 18.5 million word corpus from Al-Hayat newspaper text, with articles tagged as belonging to one of 7 domains. We outline the profile of the data and how we assessed its representativeness. The literature suggests that the statistical profile of Arabic text is significantly different from that of English in ways that might affect the applicability of standard techniques. The corpus allowed us to verify a collection of experiments which had, so far, only been conducted on small, manually collected datasets. We draw some comparisons with English and conclude that there is evidence that Arabic data is much sparser than English for the same data size.
منابع مشابه
روشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملOptical Coherence Tomography and Corpus Callosum Index in Cognitive Assessment of Multiple Sclerosis Patients
Background: Multiple Sclerosis (MS) is a neurodegenerative disease of central nervous system. Different approaches have been developed to study MS progression and cognitive dysfunction as the major symptom of the disease. The current study compared Optical Coherence Tomography (OCT) and Corpus Callosum Index (CCI) for the early evaluation of cognitive dysfunction in MS patients. Objectives: T...
متن کاملAssessment of epididymal sperm obtained from dromedary camel
Testicles were isolated from dromedary camels in a local slaughterhouse at breeding and non-breeding seasons. Sperms were recovered from different parts of the epididymis (caput, corpus and cauda) and stained separately on slide glasses by eosin nigrosin staining method and dried by a hair dryer and carried to the laboratory. In the lab, slides were observed for evaluation of the proportion of ...
متن کاملQuality Assessment of General and Categorized Arabic Text Corpora
Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of A...
متن کامل